PUPpy: a primer design pipeline for substrain-level microbial detection and absolute quantification

ABSTRACT Characterizing microbial communities at high resolution and with absolute quantification is crucial to unravel the complexity and diversity of microbial ecosystems. This can be achieved with PCR assays, which enable highly selective detection and absolute quantification of microbial DNA. However, a major challenge that has hindered PCR applications in microbiome research is the design of highly specific primer sets that exclusively amplify intended targets. Here, we introduce Phylogenetically Unique Primers in python (PUPpy), a fully automated pipeline to design microbe- and group-specific primers within a given microbial community. PUPpy can be executed from a user-friendly graphical user interface, or two simple terminal commands, and it only requires coding sequence files of the community members as input. PUPpy-designed primers enable the detection of individual microbes and quantification of absolute microbial abundance in defined communities below the strain level. We experimentally evaluated the performance of PUPpy-designed primers using two bacterial communities as benchmarks. Each community comprises 10 members, exhibiting a range of genetic similarities that spanned from different phyla to substrains. PUPpy-designed primers also enable the detection of groups of bacteria in an undefined community, such as the detection of a gut bacterial family in a complex stool microbiota sample. Taxon-specific primers designed with PUPpy showed 100% specificity to their intended targets, without unintended amplification, in each community tested. Lastly, we show the absolute quantification of microbial abundance using PUPpy-designed primers in droplet digital PCR, benchmarked against 16S rRNA and shotgun sequencing. Our data shows that PUPpy-designed microbe-specific primers can be used to quantify substrain-level absolute counts, providing more resolved and accurate quantification in defined communities than short-read 16S rRNA and shotgun sequencing. IMPORTANCE Profiling microbial communities at high resolution and with absolute quantification is essential to uncover hidden ecological interactions within microbial ecosystems. Nevertheless, achieving resolved and quantitative investigations has been elusive due to methodological limitations in distinguishing and quantifying highly related microbes. Here, we describe Phylogenetically Unique Primers in python (PUPpy), an automated computational pipeline to design taxon-specific primers within defined microbial communities. Taxon-specific primers can be used to selectively detect and quantify individual microbes and larger taxa within a microbial community. PUPpy achieves substrain-level specificity without the need for computationally intensive databases and prioritizes user-friendliness by enabling both terminal and graphical user interface applications. Altogether, PUPpy enables fast, inexpensive, and highly accurate perspectives into microbial ecosystems, supporting the characterization of bacterial communities in both in vitro and complex microbiota settings.

advances in sequencing technologies have yielded an outstanding increase in assembled microbial genomes, thus enabling the design of unique phylogenetic markers across the entire genetic material and increasing the resolution achievable.
Currently, several tools exist to design primers specific to input genomic sequences, including SpeciesPrimer (29), find_differential_primers (fdp) (30), RUCS (31), and TOPSI (32).However, these tools have either been discontinued, require significant manual handling to generate configuration files, or cannot flexibly design both microbe-and group-specific primers for any user-defined combination of microbes.Another previously published primer design toolset is DECIPHER (33), which offers multiple functionalities for oligo design, including primer design coupled with in silico PCR steps.However, this feature requires pre-aligned DNA sequences as input, which complicates the primer design process and detracts from its efficiency.Altogether, these challenges highlight the need for a user-friendly, streamlined, high-throughput, and highly selective taxonspecific primer design tool.
Here, we introduce Phylogenetically Unique Primers in python (PUPpy), a fully automated pipeline to design primers targeting individual microbes or groups of microbes within a given microbial community.PUPpy only requires coding sequences (CDSs) of the microbes of interest as input to design primers with substrain-level specificity.The pipeline streamlines primer design by only requiring two commands, which can be run either from the terminal or from an intuitive graphical user inter face (GUI).We benchmarked PUPpy-designed primers against two defined bacterial communities, each consisting of 10 members with varying degrees of genetic similar ity, ranging from distinct phyla to substrains.We also evaluated the ability of PUPpydesigned primers to detect members of a highly diverse and fastidious gut bacterial family, the Muribaculaceae, in a complex and undefined conventional mouse microbiota.We show that all taxon-specific primers designed with PUPpy selectively amplified the respective targets in all communities tested.Finally, we assessed the effectiveness of PUPpy-designed primers for accurately quantifying the total microbial count in specific communities.This assessment was conducted using ddPCR and was compared with results obtained from 16S rRNA and shotgun sequencing methods.Our data suggest that taxon-specific primers enable more resolved and accurate quantification than short-read 16S rRNA and shotgun sequencing in defined microbial communities.Altogether, PUPpy represents an intuitive tool that enables absolute quantification of individual microbial taxa, an essential parameter that has been missing in microbiome studies.The versatility of PUPpy makes it suitable for a variety of applications involv ing both defined and complex communities, ranging from environmental studies to host-associated microbiota research.

Overview of the PUPpy pipeline
PUPpy is a fully automated computational pipeline that allows the design of taxon-spe cific PCR primers targeting user-defined microbial communities (Fig. 1A).This pipeline defines two primary applications of taxon-specific primer design: microbe-and groupspecific primers (Fig. 1B).Microbe-specific primers target unique CDSs only present in a single member of the community.These primers allow users to target individual taxa with full specificity, a crucial aspect in tracking and quantifying specific microbes in communities with high degrees of genetic similarity.Conversely, group-specific primers are designed to select genes that are shared across all user-defined targets (Fig. 1B).Importantly, these primers enable the selective assessment of microbial dynamics of an entire taxon (e.g., all Bacteroides) in a sample.In both microbe-and group-spe cific applications, PUPpy designs taxon-specific primers by aligning every CDS in the community and identifying unique (microbe-specific) or shared (group-specific) genes.Because PUPpy only requires CDS files as input, it can run locally in most cases and does not require memory-intensive databases (e.g., the NCBI nr database).More details on the computational workflow are provided in the Materials and methods section.
To demonstrate the applicability of PUPpy in microbiome research, we empirically validated taxon-specific primers in three increasingly complex microbial communities.These communities were composed of (i) a defined community of 10 phylogenetically distinct bacteria, (ii) a defined community of 10 bacteria from three increasingly related taxa, and (iii) a complex microbiota from conventional mouse fecal samples.By gradually increasing the genetic similarity and complexity of the community, we were able to assess the resolution limits and effectiveness of the pipeline under diverse microbial community scenarios.

FIG 1
The PUPpy pipeline designs taxon-specific primers in defined microbial communities.(A) Overview of the PUPpy workflow.Input CDS files for primer target(s) and non-targets (the background) are aligned using MMseqs2 (34).Candidate unique or shared genes are selected to design microbe-and group-specific primers, respectively, with Primer3 (35).(B) The key output of PUPpy is taxon-specific primers, which include microbe-and group-specific primers.
Microbe-specific primers selectively target individual members of the community, while group-specific primers target user-determined collections of microbes.

Microbe-specific primers show specificity to intended targets in a genetically diverse community
The first community, hereafter referred to as the "GUT community", was composed of 10 phylogenetically distinct bacteria belonging to 9 families and 5 phyla (Bacil lota, Bacteroidota, Pseudomonadota, Actinomycetota, and Verrucomicrobiota; Table S1).These bacteria were chosen as being phylogenetically distinct, common gut commen sals, and representative of the five most abundant phyla of the human gut microbiota.This community, while simple, accurately represents a defined microbial consortium similar to that represented in several gnotobiotic animal models (36)(37)(38).Notably, each member of this community possesses a significant number of unique genes and has been previously characterized (36,39,40).To confirm the genetic diversity within the GUT community, we calculated the average nucleotide identity (ANI) on the genomic assemblies for the 10 bacteria (see Materials and methods).The community included two Bacteroides species, Bacteroides ovatus and Bacteroides thetaiotaomicron, which, as expected, shared a greater degree of genetic similarity, with 80% identity and ~50% coverage (Fig. 2A).All other GUT members displayed greater genetic diversity, with <75% identity and <8.5% coverage (Fig. 2A).Following ANI analysis, we ran PUPpy with default parameters and generated microbe-specific primers for all 10 GUT community members (Table S2).Consistent with the degree of genetic diversity observed in the ANI analysis, PUPpy identified fewer unique CDSs in more genetically similar organisms.PUPpy orders primer targets based on descending number of unique genes found, enabling an immediate visual and quantitative evaluation of commun ity diversity (Fig. 2B).Specifically, compared to all other community members, PUPpy identified ~27% of the CDSs as unique in B. ovatus and ~26% in B. thetaiotaomicron, the most related microbes, as opposed to ~74% in E. coli BW25113 (Fig. 2B).Importantly, PUPpy considers all unique CDSs to design microbe-specific primers, and multiple primers can be designed on the same CDS.Thus, even in the organism with the fewest unique genes, B. thetaiotaomicron, PUPpy evaluated 1,248 options (i.e., ~26%) for primer design (Fig. 2B).Next, we empirically validated the specificity of the GUT microbe-specific primers by PCR and gel electrophoresis using genomic DNA (gDNA) isolated from the 10 GUT community members (see Materials and methods).Each primer pair was tested against (i) the respective gDNA positive control, (ii) the 10-member gDNA pool, (iii) a 9-member gDNA pool including all taxa except the primer target, and (iv) a no-template control (Fig. 2C).Following this experimental validation, we found that all microbe-spe cific primers were specific to the respective primer target, without unintended amplification (Fig. 2D; Fig. S1).PUPpy can, therefore, be leveraged to design microbe-specific primers and selectively detect individual microbes in a genetically diverse and defined microbial community.

Taxon-specific primers selectively detect microbes down to the substrain level
As microbial communities often harbor multiple species and strains within the same genus, we next asked whether PUPpy could be used to design taxon-specific primers and detect highly related microbes.To test this, we created a second 10-member bacterial community, hereafter referred to as the "Species-Strain-Substrain (SSS) community" (Table S1).This community consisted of three Enterocloster species, three B. thetaiotao micron strains, and four E. coli K-12 substrains.These bacteria were chosen to assess the taxonomic resolution achievable by PUPpy, while also maintaining a degree of phylogenetic diversity across taxa.To evaluate the genetic similarity in the community, we performed ANI analysis, which grouped the SSS community members into three distinct memberships consistent with their assigned taxonomic classifications (Fig. 3A).As expected, the Enterocloster species displayed the lowest similarity, with a minimum of ~78% identity and ~34% coverage.Conversely, the closely related E. coli substrains were more genetically similar, sharing >99.95% identity and ~99% coverage (Fig. 3A).Next, we ran PUPpy with default parameters to design both microbe-specific primers for each member and group-specific primers for each taxon (Table S2).Group-specific primers are designed on genes shared across all primer targets (e.g., all Enterocloster species) but missing in all unintended targets (e.g., all B. thetaiotaomicron strains and E. coli substrains), enabling the selective detection of microbial taxa within a micro bial community.Consistent with the ANI analysis and taxonomic classification, PUPpy ordered the 10 members into 3 groups based on the number of unique CDSs found (Fig. 3B).Specifically, PUPpy identified between ~19% and ~36% unique CDSs for the three Enterocloster species, between ~7% and ~17% for the B. thetaiotaomicron strains, and between 0% and ~1% for the E. coli K-12 substrains (Fig. 3B).Using default parameters, here the percentage of two genomes that are aligned.ANI analysis, assessing genetic similarity based on percentage identity and coverage, grouped the Enterocloster species, B. thetaiotaomicron strains, and E. coli K-12 substrains as three distinct clusters consistent with their taxonomic assignment.Enterocloster species displayed lower genetic similarity, with a minimum of ~78% identity and ~34% coverage, while E. coli substrains were almost genetically identical, sharing >99.95% identity and ~99% coverage.(B) PUPpy-generated bar plot.The number of unique genes identified by PUPpy in SSS community members decreases across increasingly related taxa.Between 19% and 36% of the CDSs in Enterocloster species are unique within the SSS community, while only 0%-1% of CDSs are unique across E. coli K-12 substrains.(C) Experimental conditions for validation of microbe-and group-specific primers specificity in the SSS community.Group-specific primers were not validated against the 10-member community (i.e., experimental condition ii).(D) Binary heatmap showing selective amplification by both microbe-and group-specific primer pairs down to the substrain level.Microbe-specific primer pairs could not be designed and validated for E. coli K-12 MC4100 because PUPpy could not identify unique genes for this organism.Individual gel imaging data can be found in Fig. S2.(E) Experimental conditions for validation of Muribaculaceae-specific primer specificity in a complex microbial community from conventional mouse fecal samples.(F) Muribaculaceae quantification by 16S rRNA sequencing of five Muribaculaceae-depleted [polyethylene glycol (PEG)-treated] (iii) and five Muribaculaceae-positive (untreated) (iv) biological replicates.Relative abundance was calculated using QIIME2 (41) (see Materials and methods section).
PUPpy did not identify any unique CDSs for E. coli MC4100 within the SSS community, and thus, no microbe-specific primers were designed or tested for this organism (Fig. 3B).The lack of unique genes is likely due the close genetic similarity to all three other E. coli members.Increasing the percentage identity threshold in "puppy-align" may yield unique genes for E. coli MC4100 by decreasing the alignment stringency (see Materials and methods for details).We validated the specificity of SSS microbe-specific primers following the same experimental design as the GUT community (Fig. 3C).In addition, we tested the specificity of group-specific primers by running each pair against (i) the respective primer gDNA targets, [condition (ii) was not run as it would provide redundant information], iii) the nine-member SSS gDNA pool including all taxa except the target strain gDNAs, and (iv) a no-template control (Fig. 3C).All microbe-and group-specific primers tested showed specificity to their intended targets, independent of taxonomic level and genetic similarity (Fig. 3C and D; Fig. S2).These data show that PUPpy enables the selective detection of microbes down to the substrain level by identifying and designing PCR primers for unique CDSs.In addition, this validation highlights the ability of PUPpy to identify conserved CDSs across specific groups of bacteria, allowing the selective amplification of multiple organisms with a single primer pair.

Group-specific primers selectively detect targets in a complex microbial community
Having validated the specificity of PUPpy-designed primers in defined microbial communities, we asked whether the pipeline could also be applied to complex communities in which the exact membership is not known a priori.To investigate whether we could selectively identify the presence of specific taxa in this setting, we leveraged a conventional mouse gut microbiota.The mouse microbiota has a high abundance of the family Muribaculaceae, a highly diverse and hard-to-culture bacterial family that is broadly prevalent in warm-blooded animals (42)(43)(44).We have previously shown that mice treated with the osmotic laxative polyethylene glycol (PEG) are depleted of Muribaculaceae (45), and we wanted to ascertain whether group-spe cific primers to this family could identify the loss of this family without the need for sequencing.We extracted gDNA from fecal samples of mice that were either untreated (Muribaculaceae-positive) or PEG-treated (Muribaculaceae-depleted; see Materials and methods).We confirmed Muribaculaceae presence or absence by performing 16S rRNA sequencing on the fecal gDNA samples, and then evaluated PUPpy-designed primers against the Muribaculaceae family.Such experimental validation presents two main challenges in the design of specific primers: (i) the overall composition of the com munity was treated as unknown a priori, and therefore, we did not have a complete list of CDS files to input into PUPpy and (ii) we did not know which Muribaculaceae strains were present in the community of untreated mice, and thus for which microbes to design primers.Therefore, to minimize the chances of off-target amplification, we selected a comprehensive list of 156 non-targets and 11 intended Muribaculaceae targets (Table S1).The non-targets were selected among prevalent gut commensal bacteria to represent members of the phyla Bacillota, Bacteroidota, Pseudomonadota, Actinomycetota, Verrucomicrobiota, and Fusobacteriota.Using this 167-member input community, we were able to identify a gene shared across all Muribaculaceae but absent in all other 156 members.Importantly, since Muribaculaceae is a highly diverse family (43), we designed eight primer pairs on this gene to account for the higher potential representation of different sub-taxa in the complex community and pooled them to create a Muribaculaceae-specific primer cocktail.We validated this mix against both Muribaculaceae-positive and Muribaculaceae-depleted fecal samples and evaluated its specificity (Fig. 3E; see Materials and methods).Our results show that Muribaculaceaespecific primers selectively detected Muribaculaceae in the positive samples but did not in the depleted ones (Fig. S3), consistent with the results from 16S rRNA sequenc ing (Fig. 3F).In addition to being specific, these Muribaculaceae-specific primers did not display any off-target amplification of PCR amplicons with different sizes in the untreated samples (see Fig. S3).This is crucial in PCR assays such as qPCR and ddPCR, where quantification may be biased by unintended amplification.Altogether, we define a methodology involving the use of PUPpy to design taxon-specific primers that can selectively amplify microbial targets even in a complex community.

Taxon-specific primers enable high-resolution absolute microbial quantifica tion
Beyond the detection of the presence and absence of microbes, taxon-specific primers can also be used to quantify absolute microbial counts at extremely high resolution using qPCR or ddPCR (24,29).To evaluate microbial quantification with PUPpy, we compared quantification by PUPpy-designed primers via ddPCR to short-read 16S rRNA and shotgun sequencing.We pooled 10 extracted gDNA samples for each member of the GUT and SSS communities, respectively, in a theoretical 1:1 genomic copy ratio calculated from the respective microbial gDNA concentration and genome size (see Materials and methods).We then aliquoted the gDNA pools for analysis with 16S rRNA sequencing, shotgun sequencing, and ddPCR with PUPpy primers (Fig. 4A; see Materials and methods).Using ddPCR, we calculated absolute microbial abundance, converted it to compositional data, and compared it to the relative abundance estimated by the sequencing methods.In the GUT community, ddPCR and shotgun sequencing achieved similar resolution, which, as expected, was markedly higher than 16S rRNA sequencing due to the limited variability in the V3-V4 regions targeted in the latter technique (Fig. 4B).Specifically, amplicon sequencing classified only four members at the species level (Akkermansia muciniphila, Eubacterium rectale, B. ovatus, and B. thetaiotaomicron) and the remaining six members at the genus level (Fig. 4B).Conversely, both shotgun sequencing and ddPCR identified all 10 members at the greatest taxonomic resolution.In addition, ddPCR provided a more quantitative assessment of bacterial composition by measuring absolute microbial counts, which is inherently more accurate than relative abundance.
ddPCR also yielded accurate and absolute quantification in the genetically related SSS members (Fig. 4C).This assay selectively detected and quantified absolute counts for all SSS community members, except E. coli K-12 MC4100, for which we did not design primers given its similarity within the community.Similar to the results for the GUT community, 16S rRNA sequencing achieved the lowest resolution, only detecting the three major taxa and no individual strains, while shotgun sequencing detected all 10 microbes (Fig. 4C).Although shotgun sequencing identified all individual members, its quantification accuracy was impacted by the high degree of genetic similarity within the community (Fig. S4).In these scenarios, samples contain a substantial proportion of ambiguous reads (reads that map equally well to multiple targets in the reference database).Such reads can be either tossed out, assigned to the Lowest Common Ancestor (LCA), or randomly distributed across multi-mapping microbes (46).The fate of ambiguous reads varies with the goal of the study, pipelines used, and downstream applications.In the SSS community, the LCA approach would not achieve the resolu tion desired, and tossing out reads removed ~60% of them, heavily underestimating microbial abundance in strains and substrains (Fig. S4).Thus, we instructed Bowtie2 (47) to randomly distribute ambiguous reads and quantified microbes with coverM (48), which yielded a near uniform quantification of strains and substrains (Fig. 4C).However, we found this to be an artifact of randomly assigning multimapping reads, which coincidentally matched the expected input and would not extend to scenarios where strains and substrains are present in uneven proportions.
To further investigate the impact of multi-mapping reads for communities of arbitrary microbial concentrations, we generated in silico reads for the SSS members at known input proportions and analyzed these synthetic reads using both the heuristic alignment approach used in Fig. 4 and Kraken2 (49), an LCA approach (see Materials and methods for details).Unlike the heuristic alignment, this approach does not attempt to randomly assign ambiguous reads to multimapping targets and instead assigns them to the LCA.We did not apply Bracken (50) to Kraken2 data to avoid collapsing quantification to the species level, which would reduce the resolution needed to quantify the B. thetaio taomicron strains and E. coli substrains (Fig. S4).Both the heuristic alignment and LCA methods accurately quantified the three Enterocloster species, confirming the accuracy of shotgun sequencing when microbes are sufficiently distinct for reads to be confidently assigned (Fig. 5A and B).However, quantification accuracy with both pipelines gradually decreased as genetic similarity increased.Notably, the observed abundance obtained with the LCA approach for all strains and substrains was considerably lower than with heuristic alignment, consistent with the random distribution of ambiguous reads by the latter method (Fig. 5A and B).To further support this, we found that quantification with Kraken2 was considerably more accurate at the LCA level as opposed to the microbe level for both strains and substrains (Fig. 5C and D; Fig. S4).Specifically, the ratio of observed vs expected relative abundance, expected to be one in perfect scenarios, was ~0.8 for the E. coli LCA (Fig. 5C), while it dropped to between ~0.05 and ~0.001 for individual E. coli substrains (Fig. 5D).These data suggest that most reads belonging to closely related strains could not be unambiguously mapped with the methods tested and were randomly distributed to multi-mapping targets instead.These findings thus confirm that the shotgun sequencing quantification pipeline used in Fig. 4 was not actually able to accurately quantify individual strains and substrains in the SSS com munity and instead yielded a uniform abundance by randomly distributing ambiguous reads (Fig. 4C).Altogether, these results shine light on the possible pitfalls of short-read sequencing and highlight the need for careful considerations on the pipelines and parameters used.Additionally, these results show that PUPpy-designed taxon-specific primers enable accurate detection and absolute quantification in defined microbial communities, even in the presence of highly related microbes.

DISCUSSION
In this study, we introduced PUPpy, a new computational pipeline to design taxon-spe cific primers in microbial communities (Fig. 1).We benchmarked applications of PUPpy in microbiome research to detect and quantify absolute bacterial counts at high resolution in both defined and complex communities (Fig. 2 to 4).Our results show that PUPpydesigned taxon-specific primers can detect microbes down to the substrain level in defined bacterial communities (Fig. 2; Fig. S1 to S3).Importantly, we also used PUPpydesigned primers to selectively target taxa in complex communities, even when the full membership was unknown (Fig. 3; Fig. S3).Finally, our data suggest that microbe-specific primers can be used in ddPCR to quantify absolute microbial counts, achieving greater accuracy and resolution than short-read 16S rRNA and shotgun sequencing in defined bacterial communities (Fig. 4 and 5).
As a software package, PUPpy was specifically developed to prioritize user-friend liness, with minimal manual handling and command-line knowledge needed.The pipeline can be executed in two simple commands either from a terminal or a dedicated GUI, taking CDS files as input, and providing a table with taxon-specific primers and key parameters as output.Unlike other primer design tools, PUPpy does not require configuration files, streamlining execution to reduce chances of human error, while also remaining customizable.Users can leverage the flexibility and scalability of PUPpy to design taxon-specific primer sets rapidly and easily for a wide range of custom microbial communities.This is achieved due to its alignment-based strategy with MMseqs2 (34), which enables both rapid homology search across all genes in the community and low computational resources, making the pipeline suitable for both computing clusters and personal computers.In the current study, we successfully designed taxon-specific primers on both a computing cluster and personal computer in 2-10 minutes for each 10-member community and under 1 hour for a 167-member complex community on a cluster (see Materials and methods for details).This flexibility allows users to increase, if needed, the number of both targets and non-targets provided to adjust the confidence in primer specificity.In addition, the scalability of PUPpy enables the design of numerous primer sets, also on different genes, for each target, providing multiple ways to confirm specificity in custom microbial communities.
Importantly, increasing the community size, as well as the genetic similarity of its members, makes primer design more challenging.In these communities, the alignment step may not identify any genes sufficiently different to be binned as unique, for instance in the case of E. coli MC4100 in the SSS community (Fig. 3B).Users can adjust the alignment stringency, and thus the number of alignments reported, in "puppy-align" by modifying four key parameters: (i) minimum alignment identity, (ii) minimum align ment length, (iii) minimum alignment coverage, and (iv) coverage mode.Lowering the alignment identity, length, and coverage increases stringency by reporting more alignments, which decreases the total number of unique genes found and increases the specificity of taxon-specific primers.Conversely, raising these values lowers the alignment stringency and increases the number of genes considered to be unique, which may be beneficial in communities where no microbe-specific primers could be designed with default parameters.However, raising the minimum alignment identity, length, and coverage also increases the risk of identifying false positive unique genes.These parameters should thus be manipulated with caution and only if necessary, such as the SSS community which contains, or is expected to contain, multiple genetically related microbes.Despite having relaxed the alignment stringency, PUPpy could not identify truly unique genes for E. coli MC4100 but only false positives.This highlights the possibility of achieving greater resolution in shotgun sequencing with appropri ate sequencing depth and databases, which may not always be possible with PUPpydesigned PCR primers.Altogether, independently of the manipulation of alignment parameters, experimental validation is essential and always recommended to ensure the proper function and specificity of the primers designed.
Beyond primer design in defined communities, we successfully detected a bacterial family with high specificity within a complex microbiota, even without a priori knowl edge of the microbial membership.This can be a challenging task due to the unknown gene pool, limiting the ability to confirm primer selectivity in the alignment stage.Nevertheless, in this study, we were able to design a Muribaculaceae-specific primer cocktail for a conventional mouse gut microbiome by using a list of 156 expected non-targets and 11 intended Muribaculaceae targets.Providing an exhaustive input of non-targets is beneficial to design taxon-specific primers in undefined communities as PUPpy is more likely to identify conserved genes among the target taxa, decreasing the chances of false positive hits even in a complex community.The flexibility and scalability of PUPpy are well suited for this approach, supporting numerous CDS files as input without reaching prohibitive computational requirements.In addition, PUPpy enables users to scale primer design for multiple genes of the same microbial target.This could benefit longitudinal investigations in complex communities where gene transfers, loss, or gain could occur.Currently, PUPpy does not directly perform a phylogenetically informed gene choice prior to primer design and instead relies on the non-target list to identify conserved markers among the targets.If needed, users are encouraged to explore phylogenetic tools such as Phylomark (52) or learn about the gene selected by PUPpy prior to their investigations to ensure that the selected gene is ideal for the application of interest.
Finally, we have shown the potential of PUPpy-designed microbe-specific primers to accurately quantify absolute microbial counts in defined bacterial communities, even in the presence of genetically related members.Absolute microbial counts can be converted to compositional data, providing comprehensive insights into the dynamics of both individual members and the community.The opposite conversion, from relative to absolute, cannot be performed, confirming the key role of ddPCR and qPCR in quantitative microbial investigations.In these studies, designing and testing primers on multiple genes of the same microbe may be beneficial to ensure that genes with a single copy number are being investigated, as these could skew abundance estimation through quantitative PCR approaches.By design, PUPpy does not select genes present in multiple copies when complete microbial assemblies are provided.However, this may fail when scaffold or contig-level assemblies are used as input, as identical genes present more than once may be collapsed to the same location in the genome.Thus, users are encouraged to design multiple primers or confirm copy number prior to investigations focused on highly accurate microbial quantification.
In summary, PUPpy designs taxon-specific primers in diverse microbial communities and enables a fast, user-friendly, and scalable solution to profiling the microbiota.As microbiome research continues to uncover functional diversity beyond the species-level, the ability to detect microbes quickly and accurately below the species level will support high-resolution characterizations of microbial ecosystems that are required for the field to advance.

MATERIALS AND METHODS
Reagents and resources are listed in Table 1.

Bacterial strains and culture conditions
Metadata regarding all bacteria used in this study, including taxonomy, sources, and individual culture conditions, can be found in Table S1.All bacteria used in this study were cultured anaerobically with the following atmosphere: 5% H 2 , 5% CO 2 , and 90% N 2 (Linde Canada, Delta, BC, Canada) at 37°C.All media were pre-reduced in an anaerobic chamber (Coy Laboratories, Grass Lake, MI, USA) for at least 24 hours prior to use.Media names, abbreviations, ingredients, and recipes can be found in Table S3.All microbes were streaked onto their respective media plates from glycerol stocks stored at −80°C, followed by inoculation of single colonies into 3 mL of liquid media.Culture conditions and incubation times for individual microbes can be found in Table S1.

Technical specifications
All analyses involving PUPpy were performed on UBC ARC Sockeye, a high-performance computing platform, running an Intel Xeon Silver 4110 (2.1GHz) processor with 16 cores, 192 GB DDR4-2666 ECC RAM, and 2 TB SATA hard disk drive.PUPpy workflows were replicated on a consumer laptop with MacOS Ventura operating system, 10-core CPU with eight performance cores, 16-core GPU, 16 GB RAM, and 512 GB solid-state drive to validate the pipeline's viability under lower computational resources.

Executing the PUPpy pipeline
The PUPpy pipeline can be executed from the terminal by first running "puppy-align, " followed by the "puppy-primers" script.Alternatively, it is possible to execute both commands through a GUI by running "puppy-GUI" and interacting with a button-based system.See below and the PUPpy GitHub repository (https://github.com/Tropini-lab/PUPpy) for more detailed instructions and execution guidelines.

Sequence alignment of input coding sequence files
The first command of PUPpy, "puppy-align, " performs many-against-many local pairwise sequence alignment of all user-provided CDSs in a microbial community with MMseqs2 (34).This script requires a directory containing the CDS files of the targets for which primers should be designed.Optionally, users can provide a directory with the CDS files of non-targets, which will be exclusively used for specificity checks of the intended targets.Importantly, all CDS files provided as input in the "puppy-align" script will be aligned to ensure the specificity of all taxon-specific primers.The actual design of primers occurs in the "puppy-primers" scripts; thus, the CDS file choice in "puppy-align" is exclusively meant to indicate which microbes are present in the defined community of interest.CDS files provided by users must meet the following criteria: (i) FASTA headers must be provided in Prokka, RAST, or NCBI formats, (ii) filenames must contain the string "cds" prior to the extension, and (iii) filenames must end with the file extension ".fna".The command "puppy-align" prepends microbe identifiers prior to the string "_cds" in the CDS filenames to all FASTA headers, which is required for downstream parsing.For improved readability, users are encouraged to name CDS files with easily interpretable and unique microbe identifiers prior to the string "_cds" in the filename.Examples of acceptable CDS filenames include "B_theta_VPI5482_cds.fna" and "B_thetaiotaomi cron_VPI_5482_cds_from_genomic.fna".Following file renaming, PUPpy concatenates all input CDS files (from the primer target and non-target directories) into one file, which is used to create both the query and target databases.Next, PUPpy implements "mmseqs search" from MMseqs2 (34) (v14.7e284) with default parameters to align the query database against the target database, yielding the output file "ResultDB.tsv"with alignments of every CDS against every other CDS in the input community.The alignment percentage identity threshold of "mmseqs search" can be modified from its default value of "-p 0.3" in "puppy-align" to increase or decrease the alignment stringency.Providing "-p" >0.3 decreases the alignment stringency and thus increases the chances of finding unique genes.This is not recommended unless the input community is highly related, and no unique genes can be found with default parameters, as it will increase the chances of unspecific primer amplification.

Taxon-specific primer design
The second command of PUPpy, "puppy-primers, " designs taxon-specific primers for user-defined primer targets of the input community (Fig. 1A)."puppy-primers" expects the following inputs: (i) the key output file of puppy-align, "ResultDB.tsv," and (ii) a containing the CDS files of the primer targets PUPpy should design taxon-spe cific primers for (Fig. 1A).As default, "puppy-primers" design microbe-specific primers (Fig. 1B).For example, given a three-member bacterial community composed of B. thetaiotaomicron, B. ovatus, and E. coli, a microbe-specific primer set would only amplify B. thetaiotaomicron but no other microbe.Users can request PUPpy to design group-specific primers by adding the flag "-p group" (Fig. 1B).For example, given the same three-member community above, a Bacteroides-specific group primer set would amplify both B. ovatus and B. thetaiotaomicron but not E. coli.Depending on the mode, "puppy-primers" parses the alignment file to identify candidate CDSs to design microbe-or group-specific primers.CDSs that only align to themselves are considered for microbe-specific primers, while CDSs present in all intended targets are considered for group-specific primers."puppy-primers" orders candidate genes in decreasing length and then utilize Primer3 (35) to design primers while giving the flexibility to adjust key primer design parameters either through the terminal or GUI.A comprehensive list of all modifiable Primer3 primer design parameters, including all default values, is available by running "puppy-primers -h".The key output of PUPpy "puppy-primers" for microbe-and group-specific primers is "UniquePrimerTable.tsv"and "GroupPrimerT able.tsv",respectively.Both files contain information about the identified gene name, gene sequence, primer pair penalty scores, amplicon size, forward and reverse primer sequence, length, GC content, and melting temperature for all the user-defined primer targets.Running "puppy-primers" on microbe-specific mode also yields the barplot "UniqueGenesPlot.pdf",which shows the number of unique and non-unique genes found for every primer target of the input community.

DNA extraction and quantification
Bacterial gDNA was extracted using the DNeasy Blood and Tissue Kit (QIAGEN) and quantified with three technical replicates using the Quant-iT 1× dsDNA HS (High-Sen sitivity) Assay (Invitrogen), following the manufacturer's instructions.Microbial gDNA measured concentrations in nanogram per microliter were converted to copies per microliter using the formula below, where X is the amount of DNA in nanogram per microliter, and N is the genome size in base pairs (bp) as reported by the NCBI (Table S1).

Primer design and specificity validation
All taxon-specific primers used in this study were designed with PUPpy (see Table S2) and ordered with Thermo Fisher Scientific (https://www.thermofisher.com/order/custom-standard-oligo/). Custom Standard DNA Oligos were ordered dry, desalted, and with a synthesis scale of at least 25 nmole.

GUT and SSS defined communities
Taxon-specific primers, and their respective key parameters, designed for the GUT and SSS communities can be found in Table S2.Microbe-specific primers for both the GUT and SSS communities were designed separately by running "puppy-align" with default values, followed by "puppy-primers" with the parameters "-s 125 175 -optm 62 -tmd 1.5".Group-specific primers targeting the three major taxa of the SSS community were designed in three distinct runs by running "puppy-align" with default values, followed by "puppy-primers" with the parameters "-p group -pr <taxon_CDSs> -s 125 175 -optm 62 -tmd 1.5, " where taxon_CDSs were the CDS files of the (i) Enterocloster species, (ii) B. thetaiotaomicron strains, and (iii) E. coli substrains.Primer specificity was experimentally validated by PCR and gel electrophoresis.Each taxon-specific primer in the GUT and SSS communities was validated against four conditions: (i) the gDNA(s) targeted by the primer pair, (ii) the 10-member pool gDNA, (iii) the 10-member pool without the target gDNA(s), and (iv) a no-template control (water).The 10-member GUT and SSS communities were made by pooling an equal ratio of genomic copies of the respective microbial gDNA samples (quantified as above).

Muribaculaceae complex community
The Muribaculaceae-specific primer cocktail was created by pooling microbe-specific primer pairs targeting individual Muribaculum members (see Table S2).To maximize the chances of specificity within a complex community, microbe-specific primers targeting Muribaculum members were designed against a comprehensive input community of 167 gut commensals (including the target Muribaculaceae) available in the Tropini Strain Library (see Table S1).Although all Muribaculaceae-specific primers target the same CDS across intended targets, distinct primers were designed to account for base pair differences that may have affected primer annealing.PUPpy was used to iden tify a gene shared across all Muribaculaceae members, and individual primers were manually designed to maximize specificity across all target members.The specificity of the Muribaculaceae-specific primer cocktail within a complex community was valida ted using a Muribaculaceae-depleted and -positive conventional mouse model (45).Conventional mice were treated with 15% (wt/vol) osmotic laxative PEG for 7 days.Following PEG administration, fecal samples from both treated (Muribaculaceae-deple ted) and untreated (Muribaculaceae-positive) mice were collected, and total DNA was extracted from fecal contents using DNeasy 96 PowerSoil Pro QIAcube HT Kit (QIAGEN, Germantown, MD, USA), following the manufacturer's protocol.Muribaculaceae presence or absence in both samples was confirmed by 16S rRNA sequencing.The Muribacula ceae-specific primer cocktail was experimentally validated by PCR and gel electrophore sis against the following four conditions: (i) 11 distinct Muribaculum gDNA samples, (ii) no-template control (water), (iii) Muribaculaceae-depleted samples, and (iv) Muribacula ceae-positive.All PCR reactions used to experimentally validate primers contained 7.5 µL of 2× DreamTaq Green PCR Master Mix (Thermo Fisher), 5 µL of NF-H 2 O, 0.75 µL of forward and reverse primers (final concentration of 340 nM), and 1 µL of DNA template.PCR amplification was performed on a C1000 Touch Thermal Cycler (Bio-Rad) with the following program: 98°C for 2 min, 40 cycles of (98°C for 30 s, specific_annealing_T for 30 s, and 72°C for 15 s), 72°C for 5 min, and 4°C hold.The annealing temperature of each primer pair (specific_annealing_T) can be found in Table S2.Extension time was decreased from 1 minute to 30 seconds to avoid unspecific amplification.Following amplification, 8 µL of PCR product was run on a 1.5% agarose gel stained with SYBR Safe DNA Gel Stain (Invitrogen) at 90V for 60 minutes.Primer specificity was assessed by visually inspecting gels for the presence or absence of bands (Fig. S1 to S3).

Shotgun and 16S rRNA sequencing and analyses
Library preparation and sequencing for the defined communities were performed by the Sequencing + Bioinformatics Consortium (SBC) at the University of British Columbia (Vancouver, Canada).Extracted DNA was quantified using Qubit fluorometry.Shotgun metagenomic libraries were then prepared using the Illumina DNA prep library preparation kit (Illumina, San Diego, CA, USA).These libraries were sequenced on a NextSeq Mid output flow cell, generating paired-end 150 bp reads.For 16S sequencing, the V3 and V4 regions of the 16S rRNA gene were PCR amplified using the following primers (75): F: 5′-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCC TACGGGNGGCWGCAG and R: 5′-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTAC HVGGGTATCTAATCC.These amplicons were then converted to sequencing libraries using an eight-cycle indexing PCR with Nextera XT primers (Illumina, San Diego, CA, USA).These libraries were sequenced on a MiSeq v2 flow cell (Illumina, San Diego, CA, USA) to generate paired-end 250 bp reads.Raw base call data (bcl) were converted into fastq format using the bcl2fastq conversion software from Illumina.All computational analyses were performed on the UBC ARC Sockeye high-performance computing platform.
16S library preparation (76) for the Muribaculaceae-depleted and -positive samples was performed at Gut4Health, BC Children's Hospital Research Institute, Vancouver, BC, Canada.The V4 region was amplified using the following sequences: F: 5′-TATGGTAA TTGTGTGYCAGCMGCCGCGGTAA and R: 5′-AGTCAGTCAGCCGGACTACNVGGGTWTCTAAT.Amplicon libraries were purified, normalized, and pooled using the SequalPrep normalization plate (Applied Biosystems).Library concentrations were verified using a Qubit double stranded DNA (dsDNA) high-sensitivity assay kit (Invitrogen) and KAPA Library Quantification Kit (Roche) following manufacturer details.The purified pooled libraries were submitted to the Bioinformatics + Sequencing Consortium at UBC which verifies the DNA quality and quantity using an Agilent high-sensitivity DNA kit (Agilent) on an Agilent 2100 Bioanalyzer.Sequencing was performed on the Illumina MiSeq v2 platform with 2 × 250 paired end-read chemistry.

In silico Illumina reads generation and data analysis
In silico Illumina reads for the SSS community were simulated with InSilicoSeq (51) (v1.5.4) to achieve known input proportions, using the following parameters: "--mode basic --draft <SSSgenomes > --coverage <InputCoverage > n_reads 1.1M".The genomes of all SSS members were used to generate in silico reads at desired input proportions, determined by coverage.To achieve a wide range of input coverages for each member, all five default "--coverage" values offered by InSilicoSeq (51) were used: uniform, half normal, lognormal, exponential, and zero_inflated_lognormal. Relative abundance from these in silico Illumina shotgun sequencing reads was calculated using both heuristic alignment and LCA approaches, as described above.

Quantification of absolute microbial counts
Absolute quantification of microbial counts, measured in copies per microliter, for both the 10-member GUT and SSS pools, was achieved with ddPCR.All ddPCR reactions were run at Gut4Health (RRID:SCR_023673).The same microbe-specific primers validated for the GUT and SSS communities were used in ddPCR (see Table S2).The respective nine-member pools for each primer pair, lacking the intended target, were run as negative controls.Absolute microbial counts were converted to relative abundance by dividing the copies per microliter of each organism by the sum of copies per microliter of all organisms in the community.Microbial relative abundance measured by ddPCR in the SSS community was calculated from 9 organisms instead of 10 as we could not design microbe-specific primers for E. coli MC4100.
Briefly, the 10-member GUT and SSS gDNA pools were diluted in NF-H 2 O (Fisher Bioreagents) to achieve an expected concentration of the target DNA between 0 and 5,000 copies/μL.Every ddPCR reaction was prepared in a semi-skirted ddPCR 96-well plate (Bio-Rad) and contained 11 µL 2× QX200 ddPCR EvaGreen Supermix (Bio-Rad), 0.45 µL of forward and reverse primers (final concentration of 204 nM), 8.1 µL of NF-H 2 O, and 2 µL of the diluted gDNA pool.The ddPCR plate was placed in an Automated Drop let Generator (Bio-Rad), together with DG32 Automated Droplet Generator Cartridges (Bio-Rad), Pipet Tips for AutoDG system (Bio-Rad), and Automated Droplet Generation Oil for EvaGreen (Bio-Rad).Following droplet generation, the ddPCR plate was sealed with pierceable foil (Bio-Rad) using a PX1 PCR Plate Sealer (Bio-Rad).After sealing, the ddPCR plate was placed in a C1000 Touch Thermal Cycler (Bio-Rad) and amplified using the following program: 95°C for 5 min, 40 cycles of (95°C for 30 s and 62°C for 1 min), 4°C for 5 min, 90°C for 5 min, and 4°C hold.All changes in temperature were limited to 2 °C/s ramp rate.Droplet reading was performed with a QX200 Droplet Reader (Bio-Rad) using ddPCR Droplet Reader Oil (Bio-Rad).Manual selection of the amplitude threshold and data analysis were performed on the QX Manager v2.0 software (Bio-Rad).

Average nucleotide identity analyses
ANI analyses for the GUT and SSS communities were performed using the ANIb method of pyani (56) (v0.2.9).The CDS files of E. coli BW25113, MG1655, and DH10B were concatenated with the "cat" command and provided as pyani input (-i flag) together with the E. coli MC4100 CDS file.The ANI outputs were imported into RStudio, and heatmaps were plotted using ggplot2.

High molecular weight genomic DNA extraction
M. intestinale G6 was cultured following the conditions outlined in Table S1.High molecular weight gDNA was isolated using a Genomic-tip 100/G Kit (QIAGEN) following the manufacturer's protocol.Extracted DNA was allowed to relax at 4°C overnight and was then quantified using a Quant-iT 1× dsDNA HS Assay Kit (Invitrogen) and quality controlled using spectrophotometry.DNA integrity was confirmed on a 1% agarose gel.

FIG 2 FIG 3
FIG 2 PUPpy-designed microbe-specific primers selectively detect microbes in a community of 10 phylogenetically distinct bacteria.(A) Pairwise ANI based on the whole-genome sequences of the 10 GUT community members (Materials and methods).Color indicates percentage identity, with darker red indicating greater similarity between sequences.Gray circles indicate a percentage identity lower than 70%.Circle size indicates alignment coverage between sequences, here the percentage of two genomes that are aligned.(B) PUPpy-generated bar plot showing the number of unique genes (to the right of the strain name) identified for each member of the input community.The high number of unique genes found across GUT community members, ranging from 1,248 in B. thetaiotaomicron (~26% of the CDSs) to 3,269 in E. coli BW25113 (~74% of the CDSs), reflects the genetic diversity observed by ANI (A).(C) Experimental conditions for validation of microbe-specific primers specificity in the GUT community.The PCR gel image exemplifies the specificity of microbe-specific primers tested using the conditions above for Faecalibacterium prausnitzii.(D) Binary heatmap showing the targets amplified by each microbe-specific primer pair following the validation in (C).Each primer pair exclusively amplified the respective intended target (black squares along the diagonal) following primer optimization.Individual gel imaging data can be found in Fig. S1.

FIG 4
FIG 4 PUPpy-designed taxon-specific primers enable more accurate and resolved quantification than 16S rRNA and shotgun sequencing in defined communi ties.(A) Schematic of the workflow to prepare samples for quantification (see Materials and methods section for details).(B and C) Stacked bar plots evaluating microbial quantification with ddPCR, 16S rRNA, and shotgun sequencing in the GUT (B) and SSS (C) communities.ddPCR absolute counts were converted to relative abundance for a more direct comparison to the other methods (see Materials and methods).

FIG 5
FIG 5 Shotgun sequencing quantification of genetically related microbes by heuristic alignment inaccurately estimates relative abundance due to the random assignment of multimapping reads.(A and B) Scatter plot of observed vs expected relative abundance from in silico Illumina reads of the SSS community members, generated with InSilicoSeq (51).Relative abundance was calculated using a heuristic alignment-based approach with Bowtie2 (47), followed by CoverM (48) (A) and an LCA approach with Kraken2 (49) (B) to evaluate different shotgun sequencing quantification pipelines (see Materials and methods for details).The dotted line represents a 1:1 slope.The blue line and gray shaded area indicate a linear regression fitted on all data points from each taxon.(C and D) Boxplot of the ratio of observed and expected relative abundance for each LCA (C) and individual member (D) in the SSS community, calculated using the Kraken2 (49) pipeline.

TABLE 1
Resource table